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EFFICIENT NONPARAMETRIC CONFORMAL 
PREDICTION REGIONS 

By Jing Lei*'^, James Robins^ and Larry Wasserman^'^ 
Carnegie Mellon University^ and Harvard University^ 

We investigate and extend the conformal prediction method due 
to Vovk, Gammerman and Shafer (2005) to construct nonparametric 
prediction regions. These regions have guaranteed distribution free, 
finite sample coverage, without any assumptions on the distribution 
or the bandwidth. Exphcit convergence rates of the loss function are 
established for such regions under standard regularity conditions. Ap- 
proximations for simplifying implementation and data driven band- 
width selection methods are also discussed. The theoretical properties 
of our method are demonstrated through simulations. 

1. Introduction. 

1.1. Prediction regions and density level sets. Consider the following pre- 
diction problem: we observe iid data Yi, . . . , Yn G M'^ from a distribution P 
and we want to construct a prediction region Cn — ^n^^i^ • • • 5 CI M 
such that 



for fixed < a < 1 where P = is the product probabiUty measure over 

the (n -|- l)-tuple (Yi, . . . ,Yn+i).^ This is equivalent to E [P{Cn)] > 1 — a 
where P{Cn) is the probabihty mass of the random set C„. In other words, 
Cn traps a future independent observation Yn+i ~ P with probability at 
least 1 — a. The random set Cn is called a (1 — a) -prediction region or 
a (1 — a)-tolerance region. In this paper we will use the name "prediction 
region" for consistency of presentation while "tolerance region" is often used 
as a synonym in the literature. 

Prediction is a major focus of machine learning and statistics although 
the emphasis is often on point prediction. Prediction regions go beyond 
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merely providing a point prediction and are useful in a variety of applications 
including quality control and anomaly detection. For example, suppose a 
sequence of items is being produced or observed. If one item falls out of the 
prediction region constructed from the previous samples, it indicates that 
this item is likely to be different from the rest of the sample and some further 
investigation may be necessary. 

Another application of prediction regions is data description and cluster- 
ing. Given a random sample from a distribution, it is often of interest to 
ask where most of the probability mass is concentrated. A natural answer 
to this question is the density level set L{t) = {y G M'' : p{y) > t}, where 
p is the density function of P. When the distribution P is multimodal, a 
suitably chosen t will give a clustering of the underlying distribution (Har- 
tigan, 1975). When t is given, consistent estimators of L(t) and rates of 
convergence have been studied in detail, for example, in Polonik (1995); 
Tsybakov (1997); Baillo, Cuestas- Alberto and Cuevas (2001); Baillo (2003); 
Cadre (2006); Willett and Nowak (2007); Rigollet and Vert (2009); Rinaldo 
and Wasserman (2010). It often makes sense to define t implicitly using the 
desired probability coverage (1 — a): 

(1.2) i(a) = inf|t : P(L(t)) > 1-a}. 

Let denote the Lebesgue measure on W^. If the contour {y : p{y) = t{a)} 
has zero Lebesgue measure, then it is easily shown that 

C(°) := L{t{a)) = argmin^(C7) , 

c 

where the min is over |C : P{C) > 1 — a}. Therefore, the density based 
clustering problem can sometimes be formulated as estimation of the mini- 
mum volume prediction region. 

The study of prediction regions has a long history in statistics; see, for 
example Wilks (1941); Wald (1943); Fraser and Guttman (1956); Chatterjee 
and Patra (1980); Di Bucchianico, Einmahl and Mushkudiani (2001); Cadre 
(2006); Li and Liu (2008). For a thorough introduction to prediction regions, 
the reader is referred to the books by Guttman (1970) and Aichison and 
Dunsmore (1975). In this paper we study a newer method due to Vovk, 
Gammerman and Shafer (2005) which we describe in Section 2. 

1.2. Validity and efficiency. Let Cn be a prediction region. There are two 
natural criteria to measure its quality: validity and efficiency. By validity 
we mean that C„ has the desired coverage for all P, whereas by efficiency 
we mean that C„ is close to the optimal prediction region 

imsart-aos ver. 2011/05/20 file: conformal_vl.tex date: November 8, 2011 



NONPARAMETRIC CONFORMAL PREDICTION 



3 



1.2.1. Validity. By definition, a prediction region Cn is a function of the 
sample (Yi, ■■.,Yn) and hence its coverage P{Cn) is a random quantity. To 
formulate the notion of validity of a prediction region, Fraser and Guttman 
(1956) defined (1 — a)-prediction regions with r-confidence for Cn satisfying 

(1.3) nP{Cn) > 1 - a) > r. 

However, evaluating the exact probabihty in the above definition is rarely 
possible. Most work on nonparametric prediction regions validate their meth- 
ods using an asymptotic version (Chatterjee and Patra, 1980; Li and Liu, 
2008): 

liminf P [P(C„) > 1 - a] > r. 

On the other hand, if a procedure C„ satisfies (1.1) for every distribution 
P on M*^ and every n, then we say that Cn is a distribution free prediction 
region or has finite sample validity. 

1.2.2. Efficiency. We measure the efficiency of Cn in terms of its close- 
ness to the optimal region C^°'\ Recall that if P has a density p with respect 
to Lebesgue measure ^, then the smallest region with probability content at 
least 1 — a is 

(1.4) C7(") = {y : p{y) > t{a)} , 

where t{a) is given by (1.2), provided that the contour {y : p{y) = t{a)} has 
zero measure. Since p is unknown, C^"^ cannot be used as an estimator but 
only as a benchmark in evaluating the efficiency. We define the loss function 
of Cn by 

(1.5) R{Cn) = /x(aAC(")) 

where A denotes the symmetric set difference. Such loss functions have been 
used, for example, by Chatterjee and Patra (1980) and Li and Liu (2008) in 
nonparametric prediction region estimation and by Tsybakov (1997); Rigol- 
let and Vert (2009) in density level set estimation. Since, 

^(C„AC(")) = /i(C„) - ^(C(°)) + 2/x(C(")\C„) > /i(C„) - m(C(")) , 

it follows that the symmetric difference loss gives an upper bound on the 
excess loss 

(1.6) £{Cn) = ^Ji{Cn)-^l{c^''^)■ 
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In Chatterjee and Patra (1980) and Li and Liu (2008), a prediction region 
Cn is called asymptotically minimal if 

(1.7) MC7„AC(")) ^ 0. 

However, such an asymptotic property does not specify the rate of con- 
vergence. While convergence rate results are available for density level sets 
estimation (see Tsybakov, 1997; Rigollet and Vert, 2009; Mason and Polonik, 
2009, for example), relatively less is known about prediction regions until 
recently (Cadre, 2006; Samworth and Wand, 2010). 

1.3. This paper. In this paper, we propose an efficient and easy to com- 
pute prediction region with finite sample validity and we study the rate of 
convergence of its loss. To be specific, we construct Cn such that: 

1. Cn satisfies (1.1) for all P and all n under no assumption other than 
lid. 

2. For any A > 0, there exist constants ci(A,p) and C2{p) independent of 
n, such that 

(1.8) P (^R{Cn) > ci{X,p) (^-^y^'^^ = 0{n~^), 

for density p satisfying some standard regularity conditions. 

3. For any y G W^, the computation cost of evaluating l(y G C„) is linear 
in n. In other words, checking to see if a point y is in the prediction 
region, takes linear time. 

The convergence rate of efficiency is described by the term (logn/nY^^P\ 
We give explicit formula of constant C2{p) in terms of the global smoothness 
and the local behavior of p near the contour at level t{a). Its near optimality 
is discussed for some important special cases. 

Our prediction region is obtained by combining the idea of conformal 
prediction (Vovk, Gammerman and Shafer, 2005) with density estimation. 
We first construct a conformal prediction region that is closely related to 
a kernel density estimator. The finite sample validity is inherited from the 
nature of conformal prediction regions. Then we show that such a region, 
whose analytical form may be intractable, is sandwiched by two kernel den- 
sity level sets with carefully tuned cut-off values. Therefore the efficiency of 
the conformal prediction region can be approximated by those of the two 
kernel density level sets. As a by-product, we obtain a kernel density level 
set that always contains the conformal prediction region, and hence also sat- 
isfies finite sample validity. This observation means that, most of the time, 

imsart-aos ver. 2011/05/20 file: conformal_vl.tex date: November 8, 2011 



NONPARAMETRIC CONFORMAL PREDICTION 



5 



a kernel density estimator will have near optimal efficiency, finite sample va- 
lidity, and even lower computational cost at the same time. In the efficiency 
argument, we refine the rates of convergence for plug- in density level sets 
first developed in Cadre (2006), which may be of independent interest. 

Our method involves one tuning parameter which is the bandwidth in 
kernel density estimation. We give two practical data driven approaches to 
choose the bandwidth and demonstrate the performance through simula- 
tions. 

1.4. Related work. Our main technique for constructing prediction re- 
gions is inspired by the conformal prediction method (Vovk, Gammerman 
and Shafer, 2005; Shafer and Vovk, 2008), a general approach for construct- 
ing distribution free, sequential prediction regions using exchangeability. Al- 
though in its original appearance, conformal prediction is applied to sequen- 
tial classification and regression problems (Vovk, Nouretdinov and Gammer- 
man, 2009), it is easy to adapt the method to the prediction task described 
in (1.1). We describe this general method in Section 2 and our adaptation 
in Section 3. 

In multivariate prediction region estimation, common approaches include 
methods based on statistical equivalent blocks (Tukey, 1947; Li and Liu, 
2008) and plug-in density level sets (Chatterjee and Patra, 1980; Cadre, 
2006). In methods based on statistical equivalent blocks, an ordering func- 
tion taking values in is defined and used to order the data points. Then 
one-dimensional tolerance interval methods (e.g. Wilks, 1941) can be ap- 
plied. Such methods usually give accurate coverage but the efficiency is hard 
to prove. In particular, Li and Liu (2008) proposed an estimator using the 
multivariate spacing depth as the ordering function. Such a method is com- 
pletely nonparametric, requiring no tuning parameter, and is adaptive to the 
shape of the underlying distribution if the density level sets are convex. How- 
ever, this method requires 0(n^+^) time to compute the indicator l(y G C„) 
for any given y, which is much higher comparing to methods based on plug-in 
density level sets. Moreover, it is not clear how this method performs when 
the level sets of underlying distribution are not convex. On the other hand, 
the methods based on plug-in density level sets (Chatterjee and Patra, 1980) 
gives provable validity and efficiency in asymptotic sense regardless of the 
shape of the distribution (Cadre, 2006), while requiring only 0{n) time to 
compute the indicator function. The potential of such estimators has been 
reported empirically in Di Bucchianico, Einmahl and Mushkudiani (2001): 
" ... in principle the method based on density estimation can perform very 
well if a proper bandwidth is chosen, ..." 
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Our approach, although originally inspired by conformal prediction, can 
be viewed as a combination of the ordering based method and the density 
based method, where the ordering function is given by the estimated density. 
This agrees with the simple fact that the best ordering function is just the 
density itself. To the best of our knowledge, this method is the first one with 
both finite sample validity and explicit convergence rates. 

There are other methods for multivariate prediction regions. For example, 
Di Bucchianico, Einmahl and Mushkudiani (2001) proposed to minimize the 
volume over a pre-specified class of sets while maintaining a minimum cov- 
erage under the empirical distribution. This method works well for common 
distributions whose level sets can be well approximated by regular shapes 
such as ellipsoids and rectangles. However, its performance depends crucially 
on the pre-specified sets which cannot be very rich (must be a Donsker class) , 
and hence cannot be guaranteed for arbitrary distributions. Moreover, the 
minimization problem may be non-convex and hence computationally inten- 
sive. 

The rest of this paper is organized as follows. In Section 2 we introduce 
conformal prediction. In Section 3 we describe a construction of prediction 
region by combining conformal prediction with kernel density estimator. The 
approximation result (sandwiching lemma) and asymptotic properties are 
also discussed in Section 3. Practical methods for choosing the bandwidth 
are given in Section 4 and simulation results are presented in Section 5. 
Some closing remarks and possible future works are given in Section 6. Some 
technical proofs are given in Section 7. 

2. Conformal prediction. We can construct a valid prediction region 
using a method from Vovk, Gammerman and Shafer (2005) and Shafer and 
Vovk (2008). Although their focus was on sequential prediction with covari- 
ates, the same basic idea can be used here. The method is simple: consider a 
"conformity measure" a{P,y), which measures the "conformity" or "agree- 
ment" of a point y with respect to a distribution P. Examples of such a 
function in the multivariate case include data depth (see Liu, Parelius and 
Singh, 1999, and references therein), and the density function. For other 
choices of conformity measure, see the book by Vovk, Gammerman and 
Shafer (2005). Given an independent sample Yi,...,Yn from P, we test the 

hypothesis that (li, ^n+i) *~ P using observation (Fi, y) for 

each y G M*^ and invert the test. The test statistic is constructed using a 
with P replaced by empirical distribution P. 

When (Yi, ...,Yn, Yn+i) is a random sample from P, let Pn+i be the corre- 
sponding empirical distribution, which is symmetric in the n + 1 arguments. 
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Let 

^ n+l 

vr„+i,i = — — 1 \ a{Pn+i,Yj) < a{Pn+i,Yi) . 

By symmetry, the sequence of random variables (c7(P„,+i, 1^) : 1 < i < n+l) 
are exchangeable and hence so are (7r„+i_j : 1 < i < n + 1). Let 

^ [(n + l)aj 

a = ; — . 

n + l 

Note that (1 + l/n)~^a <a<a and so a ~ a. Then, for any a E (0, 1), 

(2.1) P(vr„+i,i > 5) > 1 - a , 

since there are at least (1 — a){n + 1) such 7rn+i,i's satisfying 'Kn+i,i > 5. 
Let 

(2.2) C(")(yi,...,y„) = [y : (vr„+i,„+i|y^^^^ J > a} , 

where -Kn^i^n+ily _,_i=y is the random variable iTn+i,n+i evaluated at Yn+i = 
y. Then (2.1) implies that 

p(y„+i GC(")(yi,...,y„)) >i-«. 

Based on the above discussion, any conformity measure a can be used to 
construct prediction regions with finite sample validity, with essentially no 
assumptions on P. The only requirement is exchangeability of {7rn+i,i} which 
is satisfied if the sample is independent. 
In this paper we use 

(2.3) a{P,y)=p{y), 

that is, a density estimate evaluated at y. We show that such a choice is 
closely related to the plug-in density level set estimator and hence can be 
proved to be asymptotically minimal with explicit rate of convergence. 

3. Kernel density estimation. Let Y = (Yi, . . . , Yn). Define the aug- 
mented data aug(Y,y) = (Yi, . . . ,Yn,y). Let p be some density estimator 
that is defined for all n. For example, p could be a parametric estimator or a 
nonparametric estimator such as a kernel density estimator. The particular 
algorithm we propose is given in Figure 1. 

Recall that under the null hypothesis Hq : (li, y„, Y^+i) *~ P, the 
ranks of p{Yi) are exchangeable, and hence P(-7r(y) < a) < a. Hence, we 
have: 
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Algorithm 1: Conformal Prediction with Density Estimation 
Input: sample (Yi, ...,¥„), density estimator p, and level a. 
For every y: 

(a) Construct p from aug(Y,i/). 

(b) Compute cri, . . . , a„+i where ai = p{Yi) for i — 1, . . . ,n and an+i ~ p(y)- 

(c) Test the null hypothesis Hq : Yn+i ~ P by computing the statistic 

^ n + l 
i—l 

Output: (inverting the test) C'"' = {y : Tv{y) > 5}. 



Fig 1 . The algorithm for computing the predtctton region. 

Lemma 1. Suppose li, y„, l^+i is an independent random sample 
from P, then 



(3.1) P (^n+i G C(")) > 1 - a , 

for all probability measures P and hence C^") is valid. 

Remark 2. Note that the prediction region is valid (has correct finite 
sample coverage) without any smoothness assumptions on p. Indeed, the re- 
gion is valid even if P does not have a density. 

3.1. Conformal prediction with kernel density estimation. Now we turn 
to the combination of conformal prediction with kernel density estimator. 
For a given bandwidth hn and kernel function i^, let 

be the usual kernel density estimator. For now, we focus on a given band- 
width hn- The theoretical and practical aspects of choosing /i„ will be dis- 
cussed in Subsection 3.3 and Section 4, respectively. For any given y G W^, 
let Yn+\ = y and define the augmented density estimator 

' hi{n + 1) ^ \ hn J 

f n \ ^ , , 1 f u — y 

(3-3) =\—-^]pn{u) + TJ7zrT-r^K^ 



n+iy"^"' ' hi{n + i) \ hn 
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Now we use the conformity measure a{Pn+i,Yi) = phiYi) and the p- value is 

n+l 



IT 

n-\-^ 

i=l 

The resulting prediction region given by Algorithm 1 is C^"-* = {v ■ T^iv) ^ 
a}. 

Figure 2 shows a one-dimensional example of the procedure, which we 
will investigate in detail later. The top left plot shows a histogram of some 
data of sample size 20 from a two-component Gaussian mixture. The next 
three plots (top right, middle left, middle right) show three kernel density 
estimators with increasing bandwidth as well as the conformal prediction 
regions derived from these estimators with a = 0.05. Every bandwidth leads 
to a valid region, but undersmoothing and oversmoothing lead to larger 
regions. The bottom left plot shows the Lebesgue measure of the region as 
a function of bandwidth. The bottom right plot shows the estimator and 
prediction region based on the bandwidth whose corresponding conformal 
prediction region has the minimal Lebesque measure. 

3.2. An approximation. The conformal prediction region given by Algo- 
rithm 1 is closely related to the kernel density estimator. In this subsection 
we further investigate this connection and state the main approximation 
result, the sandwiching lemma, which provides simple characterization of 
the conformal prediction region in terms of plug- in kernel density level sets. 
The sandwiching lemma will also be useful in the study of efficiency of the 
conformal prediction regions. 

We first introduce some notation. Define the upper and lower level sets 
of density p at level t, respectively: 

(3.4) L{t) = {y:p{y)>t}, and L\t) = {y : p{y) < t} . 

The corresponding level sets of p„ are denoted Ln{t) and Ll^{t), respectively. 
Let 

(3.5) P|._^P„ + _l_i„ 

where P„ is the empirical distribution defined by the sample Y = (Yi, ...,Yn), 
and 6y is the point mass distribution at y. Define functions 

G{t) = P{L\t)), 

n 

Gn{t) = Pn{Li{t)) = ^ l{pn{Y,) < t), 

1=1 
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Bandwidth 



Fig 2. Top left: histogram of some data. Top right, middle left, and middle right show three 
kernel density estimators with increasing bandwidth as well as the conformal prediction 
regions derived from these estimators. Bottom left: Lebesgue measure as a function of 
bandwidth. Bottom right: estimator and prediction region from the bandwidth with smallest 
prediction region. 
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Gi{t) = pymy) < t) 

= {n+ (^p^ imY^ <t) + < t)j . 

The functions G, Gn and Gn defined above are the cumulative distribu- 
tion function (CDF) of p{Y) and its empirical versions with sample Y and 
aug(Y,?/), respectively. 

By (2.2) and Algorithm 1, the conformal prediction region can be written 

as 

(3.6) C(-) = {yGM'^:G^(p^(y))>5}. 

Let 1^(1), . . . , Y(„) be the reordered data so that p„(Y(i)), . . . ,Pn{Y{n)) are 
in ascending order. Let in^a = [{n- + !)«], and define the inner and outer 
sandwiching sets: 

L- = Ln (Pn(%,,))) 

and 

L+ = L„ (p„(l(.„,,)) - {nh^r^^K) , 

where if^K = sup^,^/ \K{u) — K{u')\. Then we have the following "sandwich- 
ing" lemma, whose proof can be found in Subsection 7.1. 

Lemma 3 (Sandwiching Lemma). Assume that ||i^||oo = -^(0), then 

(3.7) L;ca(")cL+. 

According to the sandwiching lemma, -L^ also guarantees distribution free 
finite sample coverage and it is easier to analyze. The inner region, L~, which 
is not much smaller than when n is large, generally does not have finite 
sample validity. We confirm this through simulations in Section 5. Next we 
investigate the efficiency of these prediction regions. 

3.3. Asymptotic properties. In this subsection we prove asymptotic effi- 
ciency of C^"^ and the sandwiching sets in terms of the convergence rates of 
their loss. 

Recall that the optimal prediction region at level 1 — a can be written as 

(3.8) C7(")=L(t(")) = {y:p(y)>t(-)}, 

where t^^^ is the cut-off value of the density function so that the probability 
mass in the lower level set is exactly a: 

(3.9) G(t(")) = ¥{p{Y) < t(")) = a. 
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This holds if we assume G is continuous at t^") so that the above equation 
imphes ¥(p{Y) > t^"^) = 1 — a. This is equivalent to assuming that the 
contour of p at value t^°'\ {y : p{y) = t^"^}, has zero measure under P. 

The inner and outer sandwiching sets L~ and are plug-in estimators 
of density level sets of the form: 

(3.10) L„(4-)) = {y:i?„(y)>tW}, 

where ti"^ = A^(^(^„,„)) for the inner set L" and tlt^ = Pn(^(i„_„)) - 

{nhfi))~^'ipK for the outer set L^. Here we can view as an estimate 
of t^'^\ In Cadre, Pelletier and Pudlo (2009) it is shown that, under regu- 
larity conditions of the density p, the plug-in estimators and L„(ti"'') 
using kernel density estimator are consistent with convergence rate 1 / \/nh^ 
for a range of hn- Here, we refine the results using a set of slightly modified 
conditions. 

Intuitively speaking, for any density estimator pn and cut-off values t^"^ , 
the plug-in density level set Ln(tn"^) is an accurate estimator of L(t*^")) if: 

1. The estimated density function, p„, is close to the true density p. 

2. The true density is not too flat around level i*^"^. 

3. The estimated cut-off value t^'^ is an accurate estimate of t^"\ 

The first condition has been extensively studied in the literature of nonpara- 
metric density estimation and sufficient conditions of convergence for kernel 
density estimators in various forms have been established. The second con- 
dition is more specific for density level set estimation. A common condition 
is the 7-exponent at level t^"), which is first introduced by Polonik (1995) 
and has been used by many others (see Tsybakov, 1997; RigoUet and Vert, 
2009, for example). The third condition is somewhat opposite to the second 
one. It essentially requires that the density function cannot be too steep near 
the true cut-off value. This turns out to be a natural condition whenever the 
density has bounded derivatives near the contour. We formalize this condi- 
tion through a "modified 7-exponent condition" which is detailed in Section 
3.3.2. 

3.3.1. Holder Classes of Densities. To study the efficiency of the predic- 
tion region, we need some smoothness condition on p. The Holder class is a 
popular smoothness condition in nonparametric inferences (Tsybakov, 2009, 
Section 1.2). Here we use the version given in Rigollet and Vert (2009). 

Let s = (si, Sd) be a d-tuple of non- negative integers and \ s\ = + 
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Sd- For any x G M'^, let / and -D* be the differential operator: 

ox^^ ■ ■ ■ ox/ 

Given /? > 0, for any functions / that are [/3J times differentiable, denote 
its Taylor expansion of degree [/3J at xq by 

Definition 4 (Holder class). For constants /3 > 0, L > 0, define the 
Holder class S(/3,L) to be the set of [/3\-times differentiable functions on 
such that, 

(3.11) |/(x)-/if(x)| <L||x-xo||^. 

3.3.2. The ^-exponent condition. For a density function p, and a level 
t G (0, IIpIIoo), the usual 7-exponent condition requires that there exists an 
eo > and ci > such that 

(3.12) /i({y : t < p{y) < t + e}) < Cie\ Ve G (0, eo). 

Condition (3.12) is essentially requiring that the density p(y) increases roughly 
at rate e^/''' when y moves away from the contour by an e distance. As a re- 
sult, a larger value of 7 corresponds to a faster change of the density p when 
moving away from the contour, hence it is easier to estimate the density 
level set. In this paper, we consider the modified 7-exponent condition: 

Definition 5 (Modified 7-exponent condition). We say a density func- 
tion p satisfies the modified j-exponent condition at level t, if there exist 
constants cq > and ci,C2 > 0, such that 

/QiQ\ ^ Pi{y : t_ <p{y) < t+}) ^, 

(3.13) ci < — < C2, V t - eo < t- < t+ < t + eo. 

{t+ - t-) ' 

The modified 7-exponent condition differs from the original definition in 
three aspects: 

1. First, it allows both sides of the interval to change within a neighbor- 
hood of t, which is stronger than (3.12). It does not allow the contour 
at level t to have positive measure. We note that if the contour at 
level t has positive measure, then the estimated level set has at least 
a constant loss unless the cut-off value is estimated without error. 
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2. Second, it does not only require an upper bound on the measure, but 
also a lower bound. Since the upper bound indicates that the density 
cannot be too flat around the contour, the lower bound does not allow 
the density to be too steep. This condition implies that the estimated 
cut-off value is close to the truth. It usually holds when the density 
is smooth enough around the contour. For example, when the contour 
at level t is smooth and the density p satisfies \p{y) — i| ~ S^^"' for all 
y that is 5 away from the contour and all 5 small enough (Tsybakov, 
1997). 

3. Moreover, in the modified condition, we use the measure induced by 
p, rather than the Lebesgue measure. This is a minor difference since 
we always have, for all t — eo < t- < t+ < t + eo, 

^ P{{y:t^<piy)<U}) ^ 

K{y-t-<piy)<u}) + 

3.3.3. Conditions on the Kernel. A standard condition on the kernel is 
the notion of /3-valid kernels. 

Definition 6 (/3-valid kernel). For any (3 > 0, a function K -.W^ ^ 
is a (3 -valid kernel if 

1. K is supported on [—1,1]'^. 

2. JK = 1. 

3. j\KY < oo, allr> 1. 

4. J y'K{y)dy = for all 1 < |s| < /3. 

In the literature, /3-valid kernels are usually used with Holder class of 
functions to derive fast rate of convergence. The existence of univariate /3- 
valid kernels can be found in (Tsybakov, 2009, Section 1.2). A multivariate 
^- valid kernel can be obtained by taking direct product of univariate /3-valid 
kernels. 



3.3.4. Asymptotic properties of estimated density level set. Consider the 
following assumptions: 

Assumption Al: 

(a) The density function p G V{f3,L), where V{P,L) is the class of all 
density functions that are in the Holder class $](/3,L). 

(b) The density p satisfies the modified 7-exponent condition at level t^"^ . 

(c) The density function p is uniformly bounded by a constant L. 
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Assumption A2: The bandwidth satisfies 

1 

(3.14) K ^ (^) . 
Assumption A3: The kernel K is /3-valid and ||i^||oo = -^(0). 

These assumptions extend those in (Cadre, Pelletier and Pudlo, 2009), 
where /3 = 1 is considered. Also Al(b) considered here is a local version. 

The next theorem states the quality of cut-off values used in the sand- 
wiching sets L~ and L^. 

Theorem 7. Let tl"-* = ^^(yj-j^ ^-j), where pn is the kernel density es- 
timator given by eq. (3.2), and y(j) and in,a o,re defined as in Section 3.2. 
Under assumptions A1-A3, for any A > 0, there exist constants A\, A'^^ 
depending only on p, K and a, such that 

(3.15) P (^14-) - tW| > A, (^) + = 0{n-'). 

We give the proof of Theorem 7 in Section 7.2. Theorem 7 is useful for 
establishing the convergence of the corresponding level set. Observing that 
(n/i^)""*^ = o((logn/n)^/(^''+'^)), it follows immediately that the cut-off value 
used in also satisfies (3.15). The next theorem gives the rate of conver- 
gence for plug-in level set estimators when the cut-off value satisfies (3.15). 

Theorem 8. Lett^^ be a random sequence which satisfies (3.15). Under 
A1-A3, for any A > 0, there exist constants B\, B'^ depending on p, K and 
a only, such that 

r (M(i„(e')Ac'")) > B, (f?) + C^l^) = o(„-) . 

By Theorem 7, the cut-off values used in L~ and both satisfy (3.15), 
so the convergence rate in Theorem 8 holds for L~ and L^. By Lemma 3, 
it also holds for 

Corollary 9. Under A1-A3, for any A > 0, there exists constant B\, 
B'^ depending on p, K and a only, such that, for all C G {C'*^"^ L~, 

( /37 1 \ 

MCAC(-)) > B, (i^l^) + B', (i^l^) " J = 0{n-\ 
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In the most common case /3 = 7 = 1, the term (log n/n)^'''/^^^"'"'^) domi- 
nates the convergence rate. If we further assume that the level set L{t^°^^) is 
star-shaped (or more generally, a union of star-shaped sets), then the rate 
given by Corollary 9 is near optimal, up to a logarithm term. Indeed, the 
rate in equation (3.16) is within a logarithm term of the minimax risk for 
density level set estimation as developed in Tsybakov (1997). But note that 
the problem considered here is harder than estimating density level set at 
a fixed level since the cut-off value is not known in advance and needs to 
be estimated. Indeed, the logarithm term comes from estimating We 
also note that the continuity condition on p is slightly different than that in 
Tsybakov (1997) where it is assumed that the density contour at the desired 
level is in a Holder class. But the same construction of the lower bound can 
be used under the global smoothness conditions Al(a) and Al(b). 

A minimax risk rate of the plug-in density level set at a fixed level has 
been developed by Rigollet and Vert (2009). Although the rate is similar as 
that obtained in this paper, the construction of the lower bound only applies 
to fixed cut-off values close to 1, and hence has only limited application to 
the range of a values of practical interest. 

4. Choosing the bandwidth. As illustrated in Figure 2, the efficiency 
of C^°^^ depends on the choice of The size of estimated prediction region 
can be very large if the bandwidth is either too large or too small. There- 
fore, in practice it is desirable to choose a good bandwidth in an automatic 
and data driven manner. In kernel density estimation, the choice of band- 
width has been one of the most important topics and many approaches have 
been studied; see Loader (1999) and Mammen et al. (2011) and references 
therein. Intuitively, a good density estimator p will likely lead to a good 
prediction region, and the dependence on n of the (near) optimal choice of 
hn in Theorem 8 is similar to that in the context of kernel density estima- 
tion. However, this is not quite the case (Samworth and Wand, 2010). The 
intuition is simple: For density estimation, a good bandwidth guarantees the 
accuracy of estimated density in the whole space, whereas for level sets it 
suffices to estimate the density accurately near the contour. 

We propose two practical methods to choose a good bandwidth from 
a given candidate set T-L = {/ii, . . . , /im}, based on the idea that a good 
prediction region has small Lebesgue measure; see Figures 3 and 4. The 
methods introduced here are applicable to any prediction region estimator 
C with finite sample validity. In both approaches, we compute the prediction 
region for each h £ Ti and choose the one with the smallest volume. To 
preserve finite sample validity, the first approach, described in Fig 3, uses a 
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Algorithm 2: Tuning with Bonferroni Correction 

Input: sample Y — (Yi, prediction region estimator C, and level a. 

1. Construct prediction sets {Ch ~ Ch{Yi, . ..,¥„) : h £ H} each at level 
1 — a/m, where m — 

2. Let h = argmin^ i^{Ch)- 

3. Return Cj^. 

Fig 3. Algorithm 2: bandwidth selection. 



Bonferroni correction. 

Proposition 10. If C satisfies finite sample validity for any h, then 
the estimated prediction region given by Algorithm 2 also satisfies finite 
sample validity. 

Proof. Using Bonferroni correction we have 

p(y„+i G c^) >p(y„+i G Ch, yh g n) 

heH 

>l-a, 

where the last inequality uses the fact that each Ch is a finite sample valid 
prediction region at level 1 — a/m. □ 

When m = \T-L\ is large, Algorithm 2 tends to be conservative since each 
single Ch has coverage 1 — a/m, which could be much bigger than the ideal 
(1 — a) region. The algorithm described in Figure 4 uses sample splitting 
and only sacrifices a constant rate of efficiency regardless of \T-L\. 

Proposition 11. IfC satisfies finite sample validity for all h, then ^, 
the output of Algorithm 3, also satisfies finite sample validity. 

Proof. Note that h is independent of Y2, result, 

^{PiCh,2))=^{^{P{Ch,2)\h)) 

>E (1 - a\h}j 
=1 — a . 

□ 
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Algorithm 3: Tuning With Sample Splitting 

Input: sample Y — {Yi, prediction region estimator C, and level a 

1. Split the sample randomly into two equal sized subsamples, Yi and Y2. 

2. Construct prediction regions {Ch,i '■ h £ T-L} each at level 1 — q, using 
subsample Yi. 

3. Let h — axgrnm^ fj,{Ch.i) ■ 

4. Return Cf^ ^, which is constructed using bandwidth h and subsample Y2. 



Fig 4. Algorithm 3: bandwidth selection. 

It is easy to see that these methods have smaU excess loss with high prob- 
abihty since, by construction, ^i{C) < fi{Ch*) + i^n, where h* is the best 
bandwidth that minimizes the excess loss £-{p,{Ch)) and is a negligible 
term, because for % dense enough, there exists hj G % such that hj h* . 
Although minimizing excess loss does not necessarily minimize the symmet- 
ric difference loss, a small excess loss itself is a desired property in practice 
and is also a necessary condition of small symmetric difference loss. However, 
a more detailed relationship between excess loss and symmetric difference 
loss requires extra conditions and we leave that for future research. 

5. Numerical example. A simple illustration of Algorithm 1 is pre- 
sented in Figure 2. Here we consider a two-dimensional Gaussian mixture, 
whose geometric structure allows a better visualization of the results. We 
also test the bandwidth selectors presented in Section 4. Due to the small 
value of a and limited sample size, we find Algorithm 3 more preferable than 
Algorithm 2. Thus we only present the results using bandwidth chosen by 
sample splitting. For example, when n = 200, 100 data points are used to 
select the bandwidth and the other 100 data points are used to construct 
the prediction region using the selected bandwidth. 

Table 1 shows the coverage and Lebesgue measure of the prediction region 
of level .90 over 1,000 repetitions. The coverage is excellent and the size of 
the region is close to optimal. Both the conformal region C^"-* and the outer 
sandwiching set gives correct coverage regardless of the sample size. It 
is worth noting that the inner sandwiching set L~ does not give the desired 
coverage, which suggests that decreasing the cut-off value in L+ is not merely 
an artifact of proof, but a necessary tuning. The observed excess loss also 
reflects a rate of convergence that supports our theoretical results on the 
symmetric difference loss. Taking C^"-* for example, in Corollary 9 we have 
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Table 1 

The simulation results for 2-d Gaussian mixture with a = 0.1 over 1000 repetitions. The 
Lebesgue measure of the ideal region ~ 28.02. 





Coverage 


Lebesgue 


Measure 




n = 200 


n = 1000 


n = 200 


n = 1000 




0.897 ±0.002 


0.900 ±0.001 


34.3 ±0.31 


31.1 ±0.15 


Ln 


0.882 ±0.001 


0.896 ±0.001 


34.1 ±0.22 


32.2 ±0.10 


T + 


0.900 ±0.001 


0.907 ±0.001 


36.9 ±0.21 


34.1 ± 0.10 



/3 = 7 = 1, d = 2, and 

V(log200)/200 
V^(loglOOO)/1000 ^ ' ' 

which agrees with the observed drop of average excess loss from 6 to 3 as n 
increased from 200 to 1,000. 

Figure 5 shows a typical realization of the estimators. In both panels, 
the dots are data points when n = 200. The left panel shows the conformal 
prediction region with sample splitting (blue curve) , together with the inner 
and outer sandwiching sets (red and green curves, respectively). Also plotted 
is the ideal region C*^°^ (the grey curve). It is clear that all three estimated 
regions captures the main part of the ideal region, and they are mutually 
close. On the right panel we plot a realization of the depth based approach 
from Li and Liu (2008). This approach does not require any tuning param- 
eter. However, it takes 0{n'^~^^) time to evaluate l(y G C) for any single y. 
In practice it is recommended to compute the empirical depth only for all 
the data points and use the convex hull of all data points with high depth 
as the estimated prediction region. As can be seen on the picture, such a 
convex hull construction misses the "L" shape of the ideal region. Moreover, 
the kernel density method is at least 1,000 times faster than the depth based 
method in our implementation even when n = 200. 

Figure 6 shows the effect of bandwidth on the excess loss based on a typi- 
cal implementation of conformal prediction, where the y axis is the Lebesgue 
measure of the estimated region. We observe that for the conformal predic- 
tion region C7("), the excess loss is stable for a wide range of bandwidth, 
especially that moderate undersmoothing does not harm the performance 
very much. An intuitive explanation is that the data near the contour is 
dense enough to allow for moderate undersmoothing. Similar phenomenon 
should be expected whenever a is not too small. Moreover, the selected 
bandwidth from the outer sandwiching set is close to that obtained from 
the conformal region. This observation may be of practical interest since it 
is usually much faster to compute L^. 
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10 



Optimal Set 

Outer Bound 

Inner Bound 




Data points not in region 
Data points in region 
convex hull of data poins in region 




Conformal Set 

• Data Point 



-2 



-2 



-2 



10 



-2 



10 



12 



Fig 5. Conformal prediction region (left) and the convex hull of the multivariate spacing 
depth based tolerance region (right), with data from a two-component Gaussian mixture. 

6. Conclusion. We have constructed a distribution free prediction re- 
gion by combining ideas from density estimation and conformal prediction. 
It can also be viewed as a combination of the statistically equivalent block 
methods and the density level set methods. The region is easy to compute 
and, under regularity conditions, is asymptotically near optimal. Even with- 
out the regularity conditions, the region retains its finite sample validity. 

The bandwidth tuning algorithm (Algorithm 3) used in our simulation 
resembles cross-validation, a popular device for kernel density estimators. 
In Algorithm 3, the comparison between candidate bandwidths is based on 
a direct evaluation of loss, that is, the Lebesgue measure of the estimated 
region. This feature yields both conceptually and computationally simple 
implementation which is also highly stable as observed in our simulation 
studies. Future topics of research in this aspect include understanding the 
theoretical properties of such a bandwidth selector, its connection with other 
approaches in the literature of density and level set estimation, and the 
performance under both excess loss as well as the symmetric difference loss. 

In current work we are studying nonparametric procedures that adapt 
to smoothness conditions. In principle it is possible to further develop this 
method to deal with nonparametric prediction with covariates or parametric 
models. 
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lOR,_{h/h,.) loS,(/l./'.„) 



Fig 6. Lebesgue measure of the conformal prediction region versus bandwidth for the Gaus- 
sian mixture data with n — 200 (left) and n — 1000 (right). Here hn = \/ (logn)/n. 



7. Proofs. 

7. 1 . Proof of Lemma 3. 

Proof of the sandwiching lemma. The proof is done via a direct 

aracterization of and L^. 

First, for each y G and i < in,a, we have 



n 



n+l 
>0. 



K 



y{i) - y 



h 



As a result, Gn{p^iy)) > in,a/{'n + 1) = a and hence y G C*^"^. 
Similarly, for each y ^ and i > in,a we have 



< 



n+l 
<0. 



[Ph 



Therefore, Gl{^n{y)) < {in,a - l)/{n + 1) < 5 and hence y ^ d^"\ □ 
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7.2. Proof of Theorem 7. 

Preliminaries. Recall that L^{t) is the lower level set of p at level t: {y G 
: p{y) < t}. The bias in the estimated cut-off level i^"^ can be bounded 
in terms of two quantities: 

Vn = snp\Pn{L'{t)) - P{L'{t))\, 
t>0 

and 

-^n — I \Pn pI I oo • 

Here Vn can be viewed as the maximum of the empirical process Pn — P 
over a nested class of sets, and Rn is the Lqo loss of the density estimator. 
As a result, Vn can be bounded using the standard empirical process and 
VC dimension argument, and Rn can be bounded using the smoothness of 
p and kernel K with a suitable choice of bandwidth. Formally, we provide 
upper bounds for these two quantities through the following lemma. 

Lemma 12. Let Vn, Rn be defined as above, then under assumptions Al- 
A3, for any A > 0, there exist constants Ai^\ and A2^x depending on A only, 
such that. 



Vn>MJ'^\=Oin->^) 



I1.A1 

n 



and 



K„>A,,.(!^)""U0(„-). 



Proof. First, it is easy to check that the class of sets {L^{t) : t > 0} are 
nested with VC (Vapnik-Chervonenkis) dimension 2 and hence by classical 
empirical process theory (see, for example, van der Vaart and Wellner, 1996, 
Section 2.14), there exists a constant Co > such that for all r/ > 

(7.1) P(F„ > ry) < Cqu^ exp(-nryV32). 



Let rf = A^y^ogn/n, we have 

P (Vn > A^/logn/n^ <Con'^ exp{-A^ logn/32) 
(7.2) =Con-^A'/-'^-^^ . 



The first result then follows by choosing Ai^x = a/32(A + 2). 
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Next we bound Rn- Let p = ^\pn], and 
By triangle inequality 

Rn < \ \Pn -p\\oo + Hp -P| loo- 
Due to a result of Gine and Guillou (2002) (see also equation (49) in Chapter 
3 of Prakasa Rao, 1983), under the assumptions Al(c) and A3, there exist 
constants Ci, C2 and iJo > such that have for all B > Bq, 



5n -pIIoo > Ben) <Ciexpi-C2B^logih-^)) 

(7.3) =Ci/i^^^'. 

On the other hand, by assumptions Al(a) and A3, for some constant C3 

(7.4) \\P - pWoc < Cshl 

We note that in the inequalities (7.2), (7.3) and (7.4) the constants Ci, 
i = 0, 3, depend on p and K only. Hence, 

(7.5) ^{\\P-P\\oo > {C; + B)en) < Cih^'''\ 
which concludes the second part by choosing 



A2,X = C3 + 



i2(3 + d)X 



C2 

□ 

Proof of Theorem 7. Let an = in,a/n = [{n + l)a\/n. We have 

|«n — a| < 1/n . 
Recall that the ideal level t*-"^ can be written as 

where the function G is the cumulative distribution function of p{Y), as 
defined in Subsection 3.2. By the modified 7-exponent condition the inverse 
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of G is well defined in a small neighborhood of a. When n is large enough, 
we can define t^°^"^ as 

=G-i(a„). 
Again, by the modified 7-exponent, 

ci|t(°") - < - = \an -a\< n'^. 

Therefore, for n large enough 

(7.6) < (cin)-^/^. 

Equation (7.6) allows us to switch to the problem of bounding \t^'^ — 
Recall that t^n^ = pn{Y(^i^ ^^). The key of the proof is to observe that 

4") = G-i(a„) := inf{t : G„(i) > a„} . 

Then it suffices to show that G^^ and G^^ are close at an- In fact, by 
definition of Rn we have for all t > 0: 

L\t-Rn)<^Li{t)CL\t + Rn). 
Applying the empirical measure P„ to each term in the above: 

Pn{L'{t - Rn)) < Pn{Li{t)) < P„(L^(t + Rn)). 

By definition of Vn, 

P{L\t - Rn)) -Vn< PniLiit)) < P{L^{t + Rn)) + K- 

By definition of G and Gn, the above inequality becomes 

G{t - Rn) -Vn< Gn{t) < G{t + Rn) + 

Let Wn = Rn + (2y„/ci)i/'^. Suppose n is large enough such that 



(ci\-, , / 2Ai,A /logn 



ci V n 



then on the event K < ^i^Ay 
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= G - (2K/ci)^/^) - G +an + Vn 

< an -Vn < an- 

where the last inequaUty uses the left side of the modified 7-exponent 
condition. Similarly, Gn{t^°'"^ + Wn) > an- Hence, for n large enough, if 
Vn < Ai^xy/ {log n)/n then, 

(7.7) < VFn- 

To conclude the proof, first note that 

'Ci\j ( ( log n\ 27 \ 



n / \ \ n 



Then we can find constant A'-, such that for all n large enough, 



1 • 



Let A\ = j42,a- Combining equations (7.6) and (7.7), on the event 

i\os.n\ 2i3+d / log n 

(7.9) En,x:=\Rn<Axi^] , Vn<Ai ' ^ 



we have, for n large enough. 



n / \ n 



n 



<Rn + (2q Vn)!/^ + 



1 



n 



1 



/lognVZ+'i ^ / 2Ai,A /lognV ^ /cr - 

~ \ n J \ ci \ n J \n 

p j_ 

, /logn\2'3+'* /logn\27 
7.10 <^A — — ^ 

\ n J \ n 

where the second last inequality is from the definition of -En, A f^d the last 
inequality is from the choice of A'^ The proof is concluded by observing 
P(£',^ a) ~ 0{n^'^), a consequence of Lemma 12. □ 
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7.3. Proof of Theorem 8. 

Proof of Theorem 8. In the proof we write t„ for t^rf* . Observe that 



(7.11) =/X [[pn >tn, P< t^"^}) + [{pn < tn, P > t^"^}) • 

Note that 

(7.12) {pn>tn, p C |t(°)_|t„_t(")|_i?„<p<tW}, 

and 

(7.13) [pn <tn, P> } C <p< + - tn\ + Rn} • 

Therefore 

L„(t„)AC7W 

(7.14) C {tH - \tn - -Rn<P< t^"^ + - tn\ + Rn} • 

Suppose n is large enough such that 



2^2,A ( ^ 1 + ( I < I eo A 



n / \ n 



2 ' 



where the constant ^2, A is defined as in Lemma 12 and is defined as in 
equation (7.8). Then on the event En,\ as defined in equation (7.9), applying 
Theorem 7 and condition (3.13) on the right hand side of (7.14) yields 



P (L„(t„) ACW) 

ti'^^ - \tn - t(^^\ - Rn 



<r 2 i^A /lognVZ+'i /lognV'^^^ 

M 1 
/logn\2/3+'* , /lognN 2 
7.15 <Ba — 

\ n ) \ n 

where B'-. are positive constants depend only on p, K, a and 7. □ 
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